Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support a different timeout for the last replica #1176

Merged
merged 3 commits into from
Aug 23, 2024

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented Aug 6, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#8711

What this PR does / why we need it:

The implementation in this PR makes it possible to configure the engine so that there are two engineReplicaTimeouts (short and long). The backends lightly coordinate via a new SharedTimeouts struct to ensure that most of them can time out in the normal way after engineReplicaTimeoutShort, but exactly one of them must wait engineReplicaTimeoutLong to do the same.

Note that this PR does NOT actually configure a different engineReplicaTimeoutLong. My plan is to do that in a followup after this one is approved and we decide exactly how we want to expose the new capability.

Special notes for your reviewer:

Additional documentation or context

Per #1176 (comment), I experimented with a different approach in https://github.com/ejweber/longhorn-engine/tree/8711-last-replica-timeout-previous-attempt. That one didn't work well due to lock contention between I/O operations, replica error handling, and the new logic.

@ejweber ejweber changed the title 8711 last replica timeout Support a different timeout for the last replica Aug 6, 2024
@ejweber ejweber changed the title Support a different timeout for the last replica feat: support a different timeout for the last replica Aug 6, 2024
@ejweber
Copy link
Contributor Author

ejweber commented Aug 7, 2024

UPDATE: Keeping this comment for historical purpose, but it is no longer relevant. The approach in this PR is now much different.

I had to enable pprof on the engine to understand why this wasn't working the way I wanted up until at least b8e8d7a. The reason is this:

  • I put loop in the controller that checks all existing backends to see if they have exceeded the "short" timeout every second.
  • This check requires a controller RLock for safety.
  • Once the controller identifies a backend that should time out, it notifies the backend via a channel.
  • The backend responds by "timing itself out" in the way Longhorn backends usually do.
  • The controller should handle this by setting the backend to ERR, but this requires a Lock.
  • HOWEVER, the pending writes already all have an RLock.

In a test case in which I blocked communication with all replicas simultaneously:

  • A bunch of writes had an Rlock.
  • The first backend the new logic failed waited on a Lock util the writes completed.
  • The new logic couldn't fail any more backends until it obtained an RLock.
  • The writes never finished (and released the Rlock), because there were still unfailed backends.
12 @ 0x441d4e 0x454bc5 0x454b94 0x475845 0x498b88 0xa7ad2a 0xa7dfff 0xa7703d 0xa77018 0xa6e212 0xa5a15c 0x479861
#	0x475844	sync.runtime_Semacquire+0x24									/usr/lib64/go/1.22/src/runtime/sema.go:62
#	0x498b87	sync.(*WaitGroup).Wait+0x47									/usr/lib64/go/1.22/src/sync/waitgroup.go:116
#	0xa7ad29	github.com/longhorn/longhorn-engine/pkg/controller.(*MultiWriterAt).WriteAt+0x1e9		/go/src/github.com/longhorn/longhorn-engine/pkg/controller/multi_writer_at.go:52
#	0xa7dffe	github.com/longhorn/longhorn-engine/pkg/controller.(*replicator).WriteAt+0x3e			/go/src/github.com/longhorn/longhorn-engine/pkg/controller/replicator.go:129
# ----> The rest of the stack is called with an RLock.
#	0xa7703c	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).writeInNormalMode+0x21c	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1095
#	0xa77017	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).WriteAt+0x1f7			/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1050
#	0xa6e211	github.com/longhorn/longhorn-engine/pkg/frontend/socket.DataProcessorWrapper.WriteAt+0x31	/go/src/github.com/longhorn/longhorn-engine/pkg/frontend/socket/frontend.go:160
#	0xa5a15b	github.com/longhorn/longhorn-engine/pkg/dataconn.(*Server).handleWrite+0x3b			/go/src/github.com/longhorn/longhorn-engine/pkg/dataconn/server.go:88

1 @ 0x441d4e 0x454bc5 0x454b94 0x475965 0xa7208f 0xa7206b 0xaa93cd 0xaa9693 0x9ce2ce 0xaa9338 0x9ce123 0x9b45f8 0x9b94cb 0x9b26ab 0x479861
#	0x475964	sync.runtime_SemacquireRWMutexR+0x24																			/usr/lib64/go/1.22/src/runtime/sema.go:82
#	0xa7208e	sync.(*RWMutex).RLock+0x4e																				/usr/lib64/go/1.22/src/sync/rwmutex.go:70
#	0xa7206a	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).GetExpansionErrorInfo+0x2a												/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:427
#	0xaa93cc	github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).getVolume+0x2c												/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:85
#	0xaa9692	github.com/longhorn/longhorn-engine/pkg/controller/rpc.(*ControllerServer).VolumeGet+0x12												/go/src/github.com/longhorn/longhorn-engine/pkg/controller/rpc/server.go:122
#	0x9ce2cd	github.com/longhorn/types/pkg/generated/enginerpc._ControllerService_VolumeGet_Handler.func1+0xcd											/go/src/github.com/longhorn/longhorn-engine/vendor/github.com/longhorn/types/pkg/generated/enginerpc/controller_grpc.pb.go:391
#	0xaa9337	github.com/longhorn/longhorn-engine/pkg/controller/rpc.GetControllerGRPCServer.WithIdentityValidationControllerServerInterceptor.identityValidationServerInterceptor.func1+0x797	/go/src/github.com/longhorn/longhorn-engine/pkg/interceptor/interceptor.go:64
#	0x9ce122	github.com/longhorn/types/pkg/generated/enginerpc._ControllerService_VolumeGet_Handler+0x142												/go/src/github.com/longhorn/longhorn-engine/vendor/github.com/longhorn/types/pkg/generated/enginerpc/controller_grpc.pb.go:393
#	0x9b45f7	google.golang.org/grpc.(*Server).processUnaryRPC+0xdf7																	/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1379
#	0x9b94ca	google.golang.org/grpc.(*Server).handleStream+0xe8a																	/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1790
#	0x9b26aa	google.golang.org/grpc.(*Server).serveStreams.func2.1+0x8a																/go/src/github.com/longhorn/longhorn-engine/vendor/google.golang.org/grpc/server.go:1029

1 @ 0x441d4e 0x454bc5 0x454b94 0x475965 0xa79a1b 0xa799f4 0xa798b2 0x479861
#	0x475964	sync.runtime_SemacquireRWMutexR+0x24									/usr/lib64/go/1.22/src/runtime/sema.go:82
#	0xa79a1a	sync.(*RWMutex).RLock+0xda										/usr/lib64/go/1.22/src/sync/rwmutex.go:70
#	0xa799f3	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).checkBackendTimeouts+0xb3		/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1446
#	0xa798b1	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).monitorBackendTimeouts.func1+0x111	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1437

1 @ 0x441d4e 0x454bc5 0x454b94 0x4759c5 0x49878a 0xa72d79 0xa7887a 0x479861
#	0x4759c4	sync.runtime_SemacquireRWMutex+0x24							/usr/lib64/go/1.22/src/runtime/sema.go:87
#	0x498789	sync.(*RWMutex).Lock+0x69								/usr/lib64/go/1.22/src/sync/rwmutex.go:151
#	0xa72d78	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).SetReplicaMode+0x178	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:514
#	0xa78879	github.com/longhorn/longhorn-engine/pkg/controller.(*Controller).monitoring+0x199	/go/src/github.com/longhorn/longhorn-engine/pkg/controller/control.go:1280

NOTE: Go's locks are write biased. Once a writer attempts to acquire the lock, no new readers can acquire it until after the writer has.

@ejweber ejweber force-pushed the 8711-last-replica-timeout branch 4 times, most recently from 828f7aa to 315faca Compare August 8, 2024 22:04
@ejweber ejweber marked this pull request as ready for review August 8, 2024 22:08
james-munson
james-munson previously approved these changes Aug 12, 2024
Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Longhorn 8711

Signed-off-by: Eric Weber <eric.weber@suse.com>
Longhorn 8711

Signed-off-by: Eric Weber <eric.weber@suse.com>
…outShort

Longhorn 8711

Signed-off-by: Eric Weber <eric.weber@suse.com>
@ejweber
Copy link
Contributor Author

ejweber commented Aug 21, 2024

I ended up making the long timeout double the short one and not adjusting the potential range of the short one. This makes the timeouts 8s / 16s out-of-the-box with the ability to increase them to 30s / 60s with the caveat from longhorn/longhorn#8711 (comment).

Copy link
Contributor

@james-munson james-munson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll buy that.

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Only two comments.

pkg/backend/remote/remote.go Show resolved Hide resolved
pkg/dataconn/client.go Show resolved Hide resolved
Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you for the good additional refactoring on the timeOfLastActivity logic. It is easier to understand compared to the previous logic

@mergify mergify bot merged commit 405e96f into longhorn:master Aug 23, 2024
12 checks passed
@derekbit
Copy link
Member

@mergify backport v1.7.x v1.6.x

Copy link

mergify bot commented Aug 23, 2024

backport v1.7.x v1.6.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants